Skip to main content

Skill failures - high failure rate

Monitor - https://app.axiom.co/legion-security-ry4e/monitors/view/2U5LB6xfEgLn2QDXiG

Overview​

When autonomous workers run a step and encounter an error they retry running the step several times in case the problem is transient, like slow loading in the target tool. If a step fails to run after all retry attempts it indicates a problem that affects customer experience which we need to investigate, which is what this monitor detects. Specifically, a critical alert is created for high volume failing skills - a skill with high usage rate in the last couple of hours with a very high failure rate.

Alerts raised by this monitor likely indicate one of the following issues:

  1. There is a problem with the target tool - for example if the site is down, or some functionality within it is not working
  2. There is a problem with the relevant Legion skill - for example a change in the site made an action's selector outdated

Finding the problem​

  1. Run the monitor's query to see the success rate autonomous runs by skill id, and look for the skill(s) with a low success rate below the monitor threshold in the alerted time window.

    traces
    | where name == "run_step"
    | where ['resource.deployment.environment'] in ("staging", "production")
    | summarize Successes=countif(isempty(['status.code']) or ['status.code'] == "UNSET"), Failures=countif(['status.code'] == "ERROR"), Total=count(), DistinctInvestigations=dcount(['attributes.investigation_id']) by ['attributes.template_id'], ['attributes.step_tool']
    | extend SuccessRate=100 * todouble(Successes) / Total
    | where Total > 50 // High use skills
    | project ['attributes.template_id'], ['attributes.step_tool'], SuccessRate
    | where SuccessRate <= 50
  2. For each skill experiencing a high failure rate we need to find the root problem causing the failures. Run the following query to list the investigation ids of failed runs of the skill

    traces
    | where name == "run_step"
    | where ['resource.deployment.environment'] in ("staging", "production")
    | where ['status.code'] == "ERROR"
    | where ['attributes.template_id'] == "[skill id]"
    | project ['resource.cloud.region'], ['attributes.org_id'], org_name, ['attributes.investigation_id'], ['attributes.step_tool'], ['status.message']
  3. Take one or more investigation ids that experienced failures and open them in session viewer in the customer's region. In most cases the failed action in session viewer will contain enough data to find the problem and likely fix it. If that is not the case, try running skill tests for the skill (if they exist) which may fail on the same issue if it's a tool we have access to or a regression in the skill definition and not a change in the tool